Add timeout to gRPC workitem streaming #390

halspang · 2025-03-05T23:36:55Z

This commit adds a timeout to the gRPC stream used to communicate with the backend. This was done because the backend could restart and drop the connection and the worker would not know. This causes the worker to hang and not receive any new work items. The fix is to reset the connection if a long enough period of time has passed between receiving anything on the stream.

This commit adds a timeout to the gRPC stream used to communicate with the backend. This was done because the backend could restart and drop the connection and the worker would not know. This causes the worker to hang and not receive any new work items. The fix is to reset the connection if a long enough period of time has passed between receiving anything on the stream. Signed-off-by: halspang <[email protected]>

src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs

jviau · 2025-03-06T00:01:06Z

src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs

            while (!cancellation.IsCancellationRequested)
            {
-                await foreach (P.WorkItem workItem in stream.ResponseStream.ReadAllAsync(cancellation))
+                await foreach (P.WorkItem workItem in stream.ResponseStream.ReadAllAsync(tokenSource.Token))


Would this not throw if the connection is closed?

We thought it would too, but if you go into the IAsyncStreamReader, a cancellation is actually just treated as the end of the stream and it returns normally.

jviau · 2025-03-06T00:04:28Z

src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs

                    }
                }
+
+                if (tokenSource.IsCancellationRequested || tokenSource.Token.IsCancellationRequested)


Maybe I am missing something, but it seems unlikely this line would ever be true. If IsCancellationRequested, then more likely than not stream.ResponseStream.ReadAllAsync throw OpeationCancelledException.

See above. We thought this behavior was an odd choice for the stream reader as well, but it's documented that it doesn't throw.

jviau · 2025-03-06T00:06:53Z

Can you explain the observed code flow when the scheduler shuts down? Is an exception throw? If so, ProcessWorkItemsAsync is called with many catch statements - are one of them hit? Do we misconstrue this as a worker shutdown?

I would expect this line to be hit:

durabletask-dotnet/src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs

Line 62 in d9273d8

this.Logger.SidecarDisconnected();

, is that not the case?

Signed-off-by: halspang <[email protected]>

halspang · 2025-03-06T00:10:31Z

Can you explain the observed code flow when the scheduler shuts down? Is an exception throw? If so, ProcessWorkItemsAsync is called with many catch statements - are one of them hit? Do we misconstrue this as a worker shutdown?

It doesn't throw an exception, it returns as if the stream had just ended normally. So, once the cancellation is triggered, the foreach loop exits and we check the token. If the token is cancelled, we return. If the overall cancellation was cancelled, it will exit at that level as well. If not, it creates a new connection to the scheduler.

https://grpc.github.io/grpc/csharp/api/Grpc.Core.IAsyncStreamReader-1.html

jviau

Approving to unblock, but I think this is treating a symptom and not the underlying problem. If a scheduler restart isn't closing the connection to the worker, thus ending the stream, something else is keeping that alive (such as a reverse proxy). This needs to be looked at and addressed, as it is violating some fundamental expectations of gRPC streams.

I would feel slightly better if we make this privately configurable by the AzureManaged package somehow. We already have some psuedo-internal options they use here, could add to that and have this behavior only enabled for DTS.

src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs

Signed-off-by: halspang <[email protected]>

src/Worker/Grpc/Logs.cs

Signed-off-by: halspang <[email protected]>

cgillum · 2025-03-06T18:42:20Z

I think this is treating a symptom and not the underlying problem

Agreed, but I think this is a safety mechanism we want anyways, regardless of what gRPC server implementation we're targeting, so I'm happy to go with this for now. Warning logs have been added so that we can observe this behavior and be reminded that it needs to be further root caused.

halspang force-pushed the halspang/streaming_timeout branch from 14951b4 to a098203 Compare March 5, 2025 23:43

nytian requested a review from jviau March 5, 2025 23:55

jviau reviewed Mar 6, 2025

View reviewed changes

Add 'using' to token sources

1eb2273

Signed-off-by: halspang <[email protected]>

jviau approved these changes Mar 6, 2025

View reviewed changes

cgillum reviewed Mar 6, 2025

View reviewed changes

src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Show resolved Hide resolved

src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Show resolved Hide resolved

Add logging and comments

ce962e6

Signed-off-by: halspang <[email protected]>

cgillum approved these changes Mar 6, 2025

View reviewed changes

src/Worker/Grpc/Logs.cs Outdated Show resolved Hide resolved

Update log method name

ec7b2a3

Signed-off-by: halspang <[email protected]>

cgillum merged commit 7747663 into microsoft:main Mar 6, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add timeout to gRPC workitem streaming #390

Add timeout to gRPC workitem streaming #390

Uh oh!

halspang commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

jviau Mar 6, 2025

Uh oh!

halspang Mar 6, 2025

Uh oh!

jviau Mar 6, 2025

Uh oh!

halspang Mar 6, 2025

Uh oh!

jviau commented Mar 6, 2025 •

edited

Loading

Uh oh!

halspang commented Mar 6, 2025

Uh oh!

jviau left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgillum commented Mar 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add timeout to gRPC workitem streaming #390

Add timeout to gRPC workitem streaming #390

Uh oh!

Conversation

halspang commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

jviau Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

halspang Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

jviau Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

halspang Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

jviau commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

halspang commented Mar 6, 2025

Uh oh!

jviau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgillum commented Mar 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jviau commented Mar 6, 2025 •

edited

Loading